Target of this data exploration is figure out which chemical chracteristics have influence on red wine quality. What property makes red wine to be good?
names(rw)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
summary(rw)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Each expert graded the wine quality (discrete number) between 0 (very bad) and 10 (very excellent). In data set we can see that expert graded by ranged from 3 to 8. The median value is 6.
## Warning: position_stack requires constant width: output may be incorrect
Long tail and skewed features can be transformed to more normally distribution by square root or log function. As example “sulphates” feature:
Both transformation looks better than original (more normal distributed), But the log scale feature looks more normal distributed.
This “Red Wine” data set contains 1 599 obersvations with 11 variables (features) on the chemical properties of the wine.
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
The main features in the data set are “quality”. I’d like to determine which features are best for predicting the “quality” of a diamond. I suspect “Volatile acidity”, “pH” and some combination of the other variables can be used to build a predictive model to “quality”.
I suspect “Volatile acidity”, “pH”, “Residual sugar” variables can help in investigation predictive model to “quality”.
I created “log_sulphates” that if transform the feature toward normal distribution. For able use that feature more effective with prediction model (leniar regression).
Additionally transformed wine quality into categorical variable. Wine quality is desecrate value, so we can transform it from numerical to categorical data.
This data set is tidy no need data wrangle. But the Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol histograms all skewed right with a long tail. I had to perform a log/sqrt transformation to better understand the data.
Matrix plots to understand the relationships between variables by glance. We try find correlation between the wine quality and each other property.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
In table above we can see top 5 most correlated with “quality”:
| Feature | r-value |
|---|---|
| alcohol | 0.476 |
| volatile.acidity | -0.391 |
| sulphates | 0.251 |
| citric.acid | 0.226 |
| total.sulfur.dioxide | -0.185 |
“alcohol” feature has the strongest correlation value to the wine quality. The higher quality wine tend to have higher alcholol.
There is a very good relationship between alcohol and quality. The other features didn’t seem to affect quality as much as alcohol.
There is a weaker but still strong relationship between volatile.acidity, sulphates, citric.acid, total.sulfur.dioxide and quality. The other features didn’t seem to affect quality.
The strongest relationship is alcohol with “r-value” equat to “0.476”.
Fow adding more variabled to anaysys we will add different colors (adding additional dimension). There are 5 main features. Let’s take first 2 features “alcohol” and “volatile acidity”.
We can clearly see that the higher quality wine have higher alcohol and lower volatile acidity.
The higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates (red color).
The most of high quality wines consist from 0.25 to 0.75 citric acid. We can see higher quality wine have higher alcohol (x-axis), lower citric acid (y-axis) and lower total sulfur dioxide (purple color).
Wine quality correlates strongly with alcohol and four other variables “volatile.acidity”, “sulphates”, “citric.acid”, “total.sulfur.dioxide”.
The relationship between quality and alcohol looks linear.
Linear multivariable model created for predict the wine quality based on chemical properties. The features are selected order of how strong the correlation between this feature and wine quality.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = rw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = rw)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = rw)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = rw)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide, data = rw)
##
## =================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.611*** 2.646*** 2.843***
## (0.175) (0.184) (0.196) (0.196) (0.201) (0.205)
## alcohol 0.361*** 0.314*** 0.309*** 0.309*** 0.309*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.221*** -1.265*** -1.222***
## (0.095) (0.097) (0.097) (0.113) (0.112)
## sulphates 0.679*** 0.679*** 0.696*** 0.721***
## (0.101) (0.101) (0.103) (0.103)
## citric.acid -0.079 -0.043
## (0.104) (0.104)
## total.sulfur.dioxide -0.002***
## (0.001)
## ---------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.336 0.344
## adj. R-squared 0.226 0.316 0.335 0.335 0.334 0.342
## sigma 0.710 0.668 0.659 0.659 0.659 0.655
## F 468.267 370.379 268.912 268.912 201.777 166.962
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.384 -1599.093 -1589.749
## Deviance 805.870 711.796 692.105 692.105 691.852 683.814
## AIC 3448.114 3251.628 3208.768 3208.768 3210.186 3193.499
## BIC 3464.245 3273.136 3235.654 3235.654 3242.448 3231.138
## N 1599 1599 1599 1599 1599 1599
## =================================================================================
Because we no need model for predict quality in feature, we can use whole data set for create model and look on “R-squared” value. The model with 6 features has the highest R-squared number. As the number of features increase the R-squared becomes higher.
The model can be described as:
wine_quality = 2.843 + 0.295 x alcohol - 1.222xvolatile.acidity + 0.721xsulphates - 0.043xcitric.acid - 0.002xtotal.sulfur.dioxide
R-squared: 0.344
I think that R-squared is not good and probably can’t be used in production system. We need try another model like binomial model regression.
We can clearly see that the distribution of wine quality is irregularly. The data has many items on medium quality (grade 5, 6), but fewer count on low (grade 3,4) and high (grade 7, 8) quality wine.
There is 5 features with the highest correlation (with quality) coefficient are alcohol, volatile acidity, sulphates, citric acid, total sulfur dioxide. The wine quality are grouped to low (3,4) medium (5.6) and high (7,8). High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Critic acid and sulphate increase as wine quality increase. Volatile acidity decrease as wine quality increases.
Scatter plot of top 4 features. 2 features are plotted with color that indicate wine quality. The same trend as the last figure can be observed. In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.
The red wine dataset contains information about 1599 red wines. I started out with single variate analysis. I analysed the impact of alcohol, volatile.acidity, sulphates, sulphates, citric.acid, total.sulfur.dioxide features on the quality of the red wines. I found a few interesting results especially about respect to the impact of alcohol on the quality of the wines.
Then, I moved to bivariate analysis. I tried various combinations of the variables in the data set and tried to analyse their impact on the quality of the wines. After that, I used various techniques of multivariate analysis to analyse the impact of the variables on the red wines.
I created and included in my analysys linear model, but I think that it should not be used in production system because of small R-squared.
For future exploration of this data I would like take one category of wine (for example, quality level 7 or 8) to look at the patterns which can appear in each of the quality level. Additionaly will be good get more features about red wine.
EDA really exciting and may take a huge time to research.